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Abstract 



cn 

o 

^ We introduce the Randomized Dependence Coefficient (RDC), a measure of non- 

t— 5 linear dependence between random variables of arbitrary dimension based on the 

*y-v Hirschfeld-Gebelein-Renyi Maximum Correlation Coefficient. RDC is defined in 

terms of correlation of random non-linear copula projections; it is invariant with 

I— I respect to marginal distribution transformations, has low computational cost and 

H^ is easy to implement: just five lines of R code, included at the end of the paper. 



1 Introduction 



Measuring statistical dependence between random variables is a fundamental problem in statistics. 
Commonly used measures of dependence, Pearson's rho. Spearman's rank or Kendall's tau are com- 
putationally efficient and theoretically well understood, but consider only a limited class of asso- 
ciation patterns, like linear or monotonically increasing functions. The development of non-linear 

r — dependence measures is challenging because of the radically larger amount of possible association 

[ — patterns. 

■^ Despite these difficulties, many non-linear statistical dependence measures have been developed 

^^ recently. Examples include the Alternating Conditional Expectations or backfitting algorithm (ACE) 

^^ lEllSl^ Kernel Canonical Correlation Analysis (KCCA) ([T], (Copula) Maximum Mean Discrepancy 

^ (MMD, CMMD in their HSIC formulations) ll6]|5l[T31, Distance or Brownian Correlation (dCor) 

^ 1 24, 23 1 and the Maximal Information Coefficient (MIC) |18|. However, these methods exhibit 

• '~j high computational demands (at least quadratic costs in the number of samples for KCCA, MMD, 

/\ CMMD, dCor or MIC), are limited to measuring dependencies between scalar random variables 

(ACE, MIC), show poor performance under the existence of additive noise (MIC) or can be difficult 

to implement (ACE, MIC). 

This paper develops the Randomized Dependence Coefficient (RDC), an estimator of the Hirschfeld- 
Gebelein-Renyi Maximum Correlation Coefficient (HGR) addressing the issues listed above. RDC 
defines dependence between two random variables as the largest canonical correlation between ran- 
dom non-linear projections of their respective empirical copula-transformations. RDC is invariant 
to monotonically increasing transformations, operates on random variables of arbitrary dimension, 
and has computational cost of O(nlogn) with respect to the sample size. Moreover, it is easy to 
implement: just five lines of R code, included in Appendix [A] 

The following Section reviews the classic work of Alfred Renyi 1 17 1, who proposed seven desirable 
fundamental properties of dependence measures, proved to be satisfied by the Hirschfeld-Gebelein- 
Renyi's Maximum Correlation Coefficient (HGR). Section [3] introduces the Randomized Depen- 
dence Coefficient as an estimator designed in the spirit of HGR, since HGR itself is computationally 
intractable. Properties of RDC and its relationship to other non-linear dependence measures are 
analysed in Sectionffl SectionBlvalidates the empirical performance of RDC on a series of numeri- 
cal experiments on both synthetic and real-world data. 



2 Hirschfeld-Gebelein-Renyi's Maximum Correlation Coefficient 

In 1959 1 17 1, Alfred Renyi argued that a measure of dependence p* : A" x 3^ — > [0, 1] between 
random variables X E X and Y E y should satisfy seven fundamental properties: 

1 . p* {X, Y) is defined for any pair of non-constant random variables X and Y. 

2. p*{X,Y)=p*{Y,X) 

3. 0<p*{X,Y) < 1 

4. p* {X, Y) = iff X and Y are statistically independent. 

5. For bijective Borel-measurable functions /, g : M ^ M, p* {X, Y) = p* {f{X),g{Y)). 

6. p* {X, y) = 1 if for Borel-measurable functions f or g,Y — f{X) or X = g{Y). 

7. If {X, Y) - Af{fj., S), then p*{X, Y) = \p{X, Y)\, where p is the correlation coefficient. 

Renyi also showed the Hirschfeld-Gebelein-Renyi Maximum Correlation Coefficient (HGR) ISl fTTJI 
to satisfy all these properties. HGR was defined by Gebelein in 1941 |i3| as the supremum of Pear- 
son's correlation coefficient p over all Borel-measurable functions /, g of finite variance: 

hgr(X,y)=supp(/(X),5(r)), (1) 

Since the supremum in ([T]i is over an infinite-dimensional space, HGR is not computable. It is 
an abstract concept, not a practical dependence measure. In the following we propose a scalable 
estimator with the same structure as HGR: the Randomized Dependence Coefficient. 



3 Randomized Dependence Coefficient 

The Randomized Dependence Coefficient (RDC) measures the dependence between random samples 
X E M^^" and Y E M"^^" as the largest canonical correlation between k randomly chosen non- 



linear projections of their copula transformations. Before Section 3.4 defines this concept formally, 



we describe the three necessary steps to construct the RDC statistic: copula-transformation of each 
of the two random samples (Section 3. 1 1, projection of the copulas through k randomly chosen non- 
linear maps (Section 3.2 1 and computation of the largest canonical correlation between the two sets 
of non-linear random projections (Section [33] l. Figure [T] offers a sketch of this process. 
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Figure 1: RDC computation for a simple set of samples {{xi, yi)}}21i drawn from a noisy circular 
pattern: The samples are used to estimate the copula, then mapped with randomly drawn non-linear 
functions. The RDC is the largest canonical correlation between these non-linear projections. 



3.1 Estimation of Copula-Transformations 

To achieve invariance with respect to transformations on marginal distributions (such as shifts or 
rescalings), we operate on the empirical copula transformation of the data lfT4l[T5l . Consider a ran- 
dom vector X = {Xi, . . . , Xd) with continuous marginal cumulative distribution functions (cdfs) 
Pi,l <i <d. Then the vector U = {Ui,..., Ud) := P{X) = {Pi{Xi), ..., Pd{Xd)), known as 
the copula transformation, has uniform marginals: 



Theorem 1. (Probability Integral Transform [14]) For a random variable X with cdfP, the random 
variable U := P{X) is uniformly distributed on [0, 1]. 

The random variables Ui, . . . ,Ud are known as the observation ranks of Xi, . . . , Xd- Crucially, 
U preserves the dependence structure of the original random vector X, but ignores each of its d 
marginal forms l:14J . The joint distribution of U is known as the copula of X: 

Theorem 2. (Sklar 4201/ ) Let the random vector X — (Xi , . . . , X^) have continuous marginal cdfs 
Pi, I < i < d. Then, the joint cumulative distribution of X is uniquely expressed as: 

P{Xi, ...,Xd)^ C{Pi{Xi)., ..., Pd{Xd)), (2) 

where the distribution C is known as the copula of X. 

A practical estimator of the univariate cdfs Pi , . . . , P^; is the empirical cdf: 

1 " 
P„(a;):=-Vl(X, <a;), (3) 

n ^-^ 
i=i 

which gives rise to the empirical copula transformations of a multivariate sample: 

P„(a3) = [P„a(a;i),...,P„,d(xd)]. (4) 

The Massart-Dvoretzky-Kiefer-Wolfowitz inequality 1 13| can be used to show that empirical copula 
transformations converge fast to the true transformation as the sample size increases: 

Theorem 3. (Convergence of the empirical copula, 4751 Lemma 7]) Let Xi, . . . , X.^ be an i.i.d. 
sample from a probability distribution over M.'^ with marginal cdfs Pi, ... , P^. Let P{X) be the 
copula transformation and Pn{X) the empirical copula transformation. Then, for any e > 0." 

1 / 2me^\ 
Pr sup ||P(a;)-P„(a;)||2 >e < 2dexp -— . (5) 

Computing P„(JC) involves sorting the marginals of X e M''^", thus 0{dn\og{n)) operations. 

3.2 Generation of Random Non-Linear Projections 

The second step of the RDC computation is to augment the empirical copula transformations with 
non-linear projections, so that linear methods can subsequently be used to capture non-linear depen- 
dencies on the original data. This is a classic idea also used in other areas, particularly in regression. 
In an elegant result, Rahimi and Brecht |16| proved that linear regression on random, non-linear 
projections of the original feature space can generate high-performance regressors: 

Theorem 4. (Rahimi-Brecht) Let p be a distribution on fl and \(j)(^x;w)\ < 1. Let T = 
{ f{x) = Jjj a{w)(f>{x; w)dw\ \a{w)\ < Cp{w)y Draw Wi, . . . , Wk iid from p. Further let 
5 > Q, and c be some L-Lipschitz loss function, and consider data {a;^, 2/i}"^i drawn iid from some 
arbitrary P(X, Y). The ai, . . . , a^for which fk{x) = X]i=i 0Li4>{x; Wi) minimizes the empirical 
risk c{fk{x), y) has a distance from the c-optimal estimator in !F bounded by 

Ep[c(A.(a;),2/)] -mmEp[c(/(a;),2/)] <oi(-^ + ^ LC\j\og^\ (6) 

with probability at least 1 — 25. 

Intuitively, Theorem W states that randomly selecting Wi in X]i=i '^i'Pi^i ''^i) instead of optimising 
them causes only bounded error. 

The choice of the non-linearities (/) : M ^- M is the main, unavoidable assumption in RDC. This 
choice is a well-known problem common to all non-linear regression methods and has been studied 
extensively in the theory of regression as the selection of reproducing kernel Hilbert space lfT9l 
§3.13]. The choice of the family (space) of features, and of probability distributions over it, is 
unlimited. The only way to favour one such family and distribution over another is to use prior 
assumptions about which kind of distributions the method will typically have to analyse. 



Features popular in parts of the literature are sigmoids, parabolas, radial basis functions, com- 
plex sinusoids, sines or cosines. In our experiments, we will use sine and cosine projections, 
<j){w'^x + b) :— {cos{w'^x + b),sm{w'^x + b)). Arguments favouring this choice are that shift- 
invariant kernels are approximated with these features when using the appropriate random param- 
eter sampling distribution lfT6l .ll4l p. 208] ll22l p. 24], and that functions with absolutely integrable 
Fourier transforms are approximated with L2 error below 0(1/ Vk) by k of these features [ilOil . 

Let the random parameters tijj ^ A/'(0, sJ), 6^ ^ U[—Tr,TT]. Choosing lOi to be Normal is analogous 
to the use of the Gaussian kernel for MMD, CMMD or KCCA fW^. Tuning s is analogous to 
selecting the kernel width, that is, to regularize the non-linearity of the random projections. 

Given a data collection X = {xi, . . . , a;„), we will denote by 

b{wfxi+bi) ■■■ (t){wlxi+bk) 
^{X-k,s):^\ : : : I (7) 



(j){w( x,, + bi) 



Jj. ^n 



the fc— th order random non-linear projection from X g M."^^" to ^j^^ :— ^{X; k, s) € M^*^^". 

The computational complexity of computing ^j^" using naive matrix multiplications is 0{kdn). 
However, recent techniques 1 1 1 1 allow computing these feature expansions within a computational 
cost of 0{k\og{d)n) using 0{k) storage. 

3.3 Computation of Canonical Correlations 

The final step of RDC is to compute the linear combinations of the augmented empirical copula 
transformations that have maximal correlation. Canonical Correlation Analysis (CCA, |7|) is the 
calculation of pairs of basis vectors (q;,/3) such that the projections oP^ X and 0^Y of two ran- 
dom samples X G M^^" and Y £ M'?^" are maximally correlated. The correlations between the 
projected (or canonical) random samples are referred to as canonical correlations. There exist up to 
max (rank ( X ), rank (1^)) of them. Canonical correlations p^ are the solutions to the eigenproblem: 

C-,^C,y \fa\_.fa 



C-^^Cy. J{f3 J-P [f3 )^ ^8> 

where Cxy — cov(X, Y) and the matrices Cxx and Cyy are assumed to be invertible. Therefore, 
the largest canonical correlation pi between X and Y is the supremum of the correlation coefficients 
over their linear projections, that is: pi{X,Y) — sup ^ ^^ p{a'^X,f3'^Y). 

When p,q -^ n, the cost of CCA is dominated by the estimation of the matrices Cxx, Cyy and Cxy, 
hence being 0{{p + qf'n) for two random variables of dimensions p and q, respectively. 

3.4 Formal Definition or RDC 

Given the random samples X e M^^" and Y e M''^" and the parameters k G N+ and s e M+, the 
Randomized Dependence Coefficient between X and Y is defined as: 

rdc(X,r;fc,5) := supp(a^*^^^ /3^*;^^y.) . (9) 

4 Properties of RDC 

Computational complexity: In the typical setup (very large n, large p and q, small fc) the compu- 
tational complexity of RDC is dominated by the calculation of the copula-transformations. Hence, 
we achieve a cost in terms of the sample s,\ztofO{{p+q)n\ogn + kn\og{pq) + k^n) « 0(71 log n). 

Ease of implementation: An implementation of RDC in R is included in the Appendix [A] 

Relationship to the HGR coefficient: It is tempting to wonder whether RDC is a consistent, or 
even an efficient estimator of the HGR coefficient. However, a simple experiment shows that it is not 



desirable to approximate HGR exactly on finite datasets: Consider p(X, Y) — J\f{x; 0, l)Af{y; 0, 1) 
which is independent, thus, by both Renyi's 4th and 7th properties, has hgr(X, Y) — 0. How- 
ever, for finitely many N samples from p{X, Y), almost surely, values in both X and Y are 
pairwise different and separated by a finite difference. So there exist continuous (thus Borel 
measurable) functions f{X) and g{Y) mapping both X and Y to the sorting ranks of Y, i.e. 
fi^i) = giUi) "^{^iiVi) G (^j^)- Therefore, the finite-sample version of Equation ([T]) is con- 
stant and equal to "1" for continuous random variables. Meaningful measures of dependence from 
finite samples thus must rely on some form of regularization. RDC achieves this by approximating 
the space of Borel measurable functions with the restricted function class F from TheoremH] 

Assume the optimal transformations / and g (Equation 1) to belong to the Reproducing Kernel 
Hilbert Space F (Theorem 4), with associated shift-invariant, positive semi-definite kernel function 

k{x, x') = {<p{x), (f){x'))jr < 1. Then, with probability greater than 1 — 26: 



hgr(X, Y; F) - rdc(X, Y; k) 




(10) 



where m := aoc^ + (30^ and n, k denote the sample size and number of random features. The 
bound (10 1 is the sum of two errors. The error 0{1/-Jn) is due to the convergence of CCA's 
largest eigenvalue in the finite sample size regime. This result [8, Theorem 6] is originally ob- 
tained by posing CCA as a least squares regression on the product space induced by the feature map 
'4>{x^y) = [(p{x)(f){x)'^ , (f){y)(f){y)'^ , ^/2<p{x)(j){y)'^]^ . Because of approximating t/j with fc ran- 
dom features, an additional error 0{l/^/k) is introduced in the least squares regression 1.16^ Lemma 
3]. Therefore, an equivalence between RDC and KCCA is established if RDC uses an infinite num- 
ber of sine/cosine features, the random sampling distribution is set to the inverse Fourier transform 
of the shift-invariant kernel used by KCCA and the copula-transformations are discarded. However, 
when k > n regularization is needed to avoid spurious perfect correlations, as discussed above. 

Relationship to other estimators: Table [T] summarizes several state-of-the-art dependence mea- 
sures showing, for each measure, whether it allows for general non-linear dependence estimation, 
handles multidimensional random variables, is invariant with respect to changes in marginal distri- 
butions, returns a statistic in [0, 1], satisfy Renyi's properties (Sectionl2|, and how many parameters 
it requires. As parameters, we here count the kernel function for kernel methods, the basis function 
and number of random features for RDC, the stopping tolerance for ACE and the search-grid size for 
MIC, respectively. Finally, the table lists computational complexities with respect to sample size. 



Table 1: Comparison between non-linear dependence measures. 
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Testing for independence with RDC: Consider the hypothesis "the two sets of non-linear pro- 
jections are mutually uncorrected". Under normality assumptions (or large sample sizes), Bartlett's 
approximation [12J can be used to show: 



/2fc 



\og\{{l~pj) 



Xfe2, 



(11) 



1 = 1 



where pi, . . . , pk are the canonical correlations between the two sets of non-linear projections 
*pfxi ^^'^ ^p(Y)' Alternatively, non-parametric asymptotic distributions can be obtained from 
the spectrum of the inner products of the non-linear random projection matrices ll25l Theorem 3]. 

5 Experimental Results 

We performed numerical experiments on both synthetic and real-world data to validate the empir- 
ical performance of RDC versus the non-linear dependence measures listed in Table [T] In some 
experiments, we don't compare against to KCCA due its prohibitive running times (see Tablel2]i. 

Parameter selection: The number of random features for RDC was set to fc = 10 symmetri- 
cally for both random samples, since no significant improvements were observed for larger values. 
However, this parameter can be set to the largest value that fits within the available computational 
budget. The random sampling parameters {sx,sy) were set independently for each of the two 
random samples, equal to their squared euclidean distance empirical median |5|. Competing kernel 
methods make use of Gaussian RBF kernels of the form k{x, x'; sx) = exp{—\\x — x'W^ /sx) 
for the random variable X and analogously for the random variable Y. For the MIC statistic, the 
search-grid size is set to B{n) = n"-^, as recommended by the authors fTS |. The stopping tolerance 
for ACE is set to e = 0.01, the default value in the R package acepackQ 

5.1 Synthetic Data 

Resistance to additive noise: We define the power of a dependence measure as its ability to 
discern between dependent and independent samples that share equal marginal forms. In the spirit 
of Simon and Tibshiranr] we conducted experiments to estimate the power of RDC as a measure 
of non-linear dependence. We chose 8 bivariate association patterns, depicted inside little boxes in 
Figure |3] For each of the 8 association patterns, 500 repetitions of 500 samples were generated, 
in which the input variable was uniformly distributed on the unit interval. Next, we regenerated 
the input variable randomly, to generate independent versions of each sample with equal marginals. 
Figure |3] shows the power for the discussed non-linear dependence measures as the variance of some 
zero-mean Gaussian additive noise increases from 1/30 to 3. RDC shows worse performance in 
the linear association pattern due to noise overfitting and in the step-function due to the smoothness 
prior induced by the use of sine/cosine basis functions, but has good performance in non-functional 
association patterns (such as the circle and the mixture of sinusoidal waves). 

Running times: Table |2] summarizes running times (in seconds) for the considered non-linear 
dependence measures on scalar, uniformly distributed, independent samples of sizes {10'^, . . . , 10®} 
when averaging over 100 runs. Single runs above ten minutes were cancelled (empty cells in table). 
In this comparison, Pearson's p, ACE, dCor and MIC are using compiled C code, while RDC, along 
with MMD, CMMD and KCCA are implemented as inteipreted R code. 

Table 2: Average running times (in seconds) for dependence measures on versus sample sizes. 



sample size 


Pearson's p 


RDC 


ACE 


dCor 


MMD 


CMMD 


MIC 


KCCA 


1,000 


0.0001 


0.0047 


0.0080 


0.3417 


0.3103 


0.3501 


1.0983 


166.29 


10,000 


0.0002 


0.0557 


0.0782 


59.587 


27.630 


29.522 


— 


— 


100,000 


0.0071 


0.3991 


0.5101 


— 


— 


— 


— 


— 


1,000,000 


0.0914 


4.6253 


5.3830 


— 


— 


— 


— 


— 



Value of statistic in [0, 1]: Figured shows RDC, ACE, dCor, MIC, Pearson's p. Spearman's rank 
and Kendall's r dependence estimates for 14 different associations of two scalar random variables. 
RDC scores values close to one on all the proposed dependent associations, whilst scoring values 
close to zero for the independent association, depicted last. When the associations are Gaussian (first 



■ http : //cran . r-project . orq/web/packaqes/acepack/index . h tml 



http : //www- Stat . Stanford. edu/ ~tibs/reshef /comment .pdf 



row), RDC scores values close to the Pearson's correlation coefficient, as suggested in the seventh 
property of Renyi (Section|2]i. 

5.2 Feature Selection in Real- World Data 

We performed greedy feature selection via dependence maximization II2TI on eight real-world 
datasets. More specifically, we attempted to construct the subset of features G C X that mini- 
mizes the normalized mean squared regression error (NMSE) of a Gaussian process regressor. We 
do so by selecting the feature a;*^*-' maximizing dependence between the feature set CJ,j = {Gi-i,x^^^ 
and the target variable y at each iteration i G {1, . . . 10}, such that Qq = {0} and x^*' ^ Gi-i- 

We considered 12 heterogeneous datasets, obtained from the UCI dataset repositorjr] the Gaus- 
sian process web site Datajand the Machine Learning data set repostitorjrl Random training/test 
partitions are computed to be disjoint and equal sized. 

Since Q can be multi-dimensional, we compare RDC to the non-linear methods dCor, MMD and 
CMMD. Given their quadratic computational demands, dCor, MMD and CMMD use up to 1, 000 
points when measuring dependence; this constraint only applied on the sarcos and calcensus 
datasets. Results are averages of 20 random training/test partitions. 
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Figure 2: Feature selection experiments on real- world datasets. 

Figure |2] summarizes the results for all datasets and algorithms as the number of selected features 
increases. RDC performs best in most datasets, with much lower running time than its contenders. 

6 Conclusion 



We have presented the randomized dependence coefficient, a lightweight non-linear measure of 
dependence between multivariate random samples. Constructed as a finite-dimensional estimator in 
the spirit of the Hirschfeld-Gebelein-Renyi maximum correlation coefficient, RDC performs well 
empirically, is scalable to very large datasets, and is easy to adapt to concrete problems. 
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Figure 3: Power of discussed measures on several bivariate association patterns as noise increases. 
Insets show the noise-free form of each association pattern. 
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Figure 4: RDC, ACE, dCor, MINE, Pearson's p. Spearman's rank and Kendall's t estimates (num- 
bers in tables above plots, in that order) for several bivariate association patterns. 



A R Source Code 



rdc <- f unction (x, Y, k, s ) { 

X <- cbind (apply (as .matrix (x) , 2, function (u) ecdf (u) (u) ) , 1) 

y <- cbind (apply (as .matrix (y) , 2, function (u) ecdf (u) (u) ) , 1) 

wx <- matrix (rnorm (ncol (x) *k, 0, s ), ncol (x) , k) 

wy <- matrix (rnorm (ncol (y) *k, 0, s ), ncol (y) , k) 

cancor (cbind (cos (x%*%wx) , sin (x%*%wx) ) , cbind (cos (y%*%wy) , sin (y%*%wy) ) ) $cor [ 1 ] 
} 
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